AirBnb Data Analysis Project

Project Summary

This project dives into the diverse world of Airbnb rental prices, attempting to predict them based on available data. With over 40,000 Airbnb listings across European cities, each listing is a mix of different features like number of bedrooms, location, and reviews, making pricing quite complex. The main question for this analysis: Can we predict Airbnb prices based on other variables?

Visualization of Propery costs per night (Euros €)

Heatmap for Location 1 Heatmap for Location 2 Heatmap for Location 3 Heatmap for Location 4 Heatmap for Location 5 Heatmap for Location 6 Heatmap for Location 7 Heatmap for Location 8 Heatmap for Location 9 Heatmap for Location 10

Tableau Visualizations

Cost of accommodation per night, Euros(€) (Predicted Sum vs. Real Sum)

Heatmap for Location 1 Heatmap for Location 2 Heatmap for Location 3 Heatmap for Location 4 Heatmap for Location 5 Heatmap for Location 6 Heatmap for Location 7 Heatmap for Location 8 Heatmap for Location 9 Heatmap for Location 10

I wanted to create a linear regressing model to attempt to predict the accommodation prices, because there was around 50,000 datapoints I knew that I had to use a programming language. Python is the natural choice as already had large experience with it and it is more then capible then handling a date set of this size.The other sensible option that I could have used was R.

I used the Pandas and Sklearn libraries to build the model. Once the model was built I used python to calculate the predicted results for each data point. I then exported the original data along with the predicted data to create the above plot using Tableau. R-Squared=0.949

Residual Plot, Euros(€) (Real Sum v Residual)

Residual plot

Points above the reference line indicate overestimation by the model (the predicted price is higher than the actual price), while points below represent underestimation (the predicted price is lower than the actual price).

You would expect a plot for data that fits the model to be symmetrical about a residual of zero and for the plot to show no clear trend. But as you can see the above plot is neither of those things. We can see clear overestimations for properties that cost over 600 euros. Even though the trend is increasingly more biased toward overestimation the higher the Real Sum there appears to be a somewhat linear relationship on this graph. Despite an R-Squared value of 0.949, the pattern suggests that the model is not fully capturing the underlying relationship between the variables.

The next step is to experiment with polynomial regression, as the evidence suggest that the model is underfitted.